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Immersive evironments can facilitate the virtual 
interaction between people, objects, places, and databases. Immersion has several varied 
practical applications. It can serve as an aid to engineering applications. Immersion can also 
be used to understand and aid the disabled. These environments result in the production of 
large amounts of data for transmission and storage. Data types such as images, audio, video, 
and text are an integral part of immersive environments and many researchers in the past 
have addressed their management. However, we have identified a set of less familiar data 
types, collectively termed immersidata (Shahabi, Barish, Ellenberger, Jiang, Kolahdouzan, 
Nam, & Zimmermann, 1999) that are specific to immersive environments. Immersidata are 
produced as a result of the user's interactions with an immersive environment. 
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Haptic data is a kind of immersidata that 1s used to describe the movement, rotation, 
and force associated with user-directed objects in an immersive environment. We use the 
CyberGlove as a haptic user interface to an immersive environment. The CyberGlove con- 
sists of several sensory devices that generate data at a continuous rate. The acquired data can 
be stored, queried, and analyzed for several applications. 

In this chapter, we focus our attention on the analysis of haptic data with the objective 
of modeling these data in a database. A large number of diverse applications use haptic data. 
Each such application may need haptic data stored and modeled at different levels of ab- 
straction. For now, we consider three levels of abstraction. First, in Shahabi, Barish, Ko- 
lahdouzan, Yao, Zimmermann, and Zhang (2001), we made our first attempt to model haptic 
data at the lowest level of abstraction. There, we dealt with raw haptic data conceptualized 
as time-series data sets. Such a modeling approach can be used for training applications such 
as comparing a teacher's and a student's session with the CyberGlove, to measure the stu- 
dent's proficiency at following the teacher. Second, in this chapter we move a level up from 
our previous work in using raw haptic data by trying to understand the semantics of hand 
actions, and we employ several learning techniques to develop this understanding. The ap- 
plication that we focus on is /imited vocabulary American Sign Language recognition that 
involves the translation of American Sign Language (ASL) to spoken words. Finally, for the 
third level of abstraction, there exists a class of applications that need to analyze preproc- 
essed data, as opposed to analyzing raw haptic data. An example would be the application of 
detecting the grasping behavior of the hand. This application might need the speed of the 
hand in space at a certain instant of time. We intend to study this final level of abstraction as 
part of our future work. 

We analyzed the raw haptic data acquired from the CyberGlove to recognize different 
hand signs automatically. We investigated three different analysis techniques and evaluated 
their accuracy over a 10-sign vocabulary. First, Decision Tree was used for the supervised 
classification of haptic data. This technique generates decision trees derived from a particu- 
lar data set. In particular, we used C4.5 Decision Tree to classify haptic data into static signs. 
Our experiments show that this technique can classify haptic data with an average error of 
22%. Second, we utilized Bayesian Classifier for the classification of haptic data. Bayesian 
Classifier is a fast-supervised classification technique that generates fine-grained probability 
estimates over the data set. In our experiments, this technique resulted in an average error of 
15.34%. Bayesian Classifier appears to be the fastest classification technique providing the 
best classification accuracy for our experiments. Finally, we used Supervised Neural Net- 
works as a classification technique for the recognition of both static and dynamic signs. We 
show that with Neural Networks, static signs can be recognized with an average error of 
20.18%. 

Our research is distinct and novel in the following three respects. To begin with, we 
are distinct with respect to the framework we have used for our research and experiments. 
The framework is based upon the environment provided by the CyberGlove from Immersion 
Corporation. All our analysis and experiments were performed on raw haptic data without 
any kind of preprocessing. The analyzed data sets were collected and recorded by us, using 
an application developed at our laboratory. In addition to a novel framework, we have taken 
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a new approach to modeling haptic data, which is based upon various learning techniques. 
To the best of our knowledge, we are the first to use Decision Trees for the analysis of raw 
haptic data. Bayesian Classifier was used in the past for the analysis of preprocessed haptic 
data; however, as far as we know this classification technique was never used for the analysis 
of raw haptic data. Neural Networks have been used for the analysis of haptic data in many 
research efforts in the past. Our contributions to the analysis of haptic data using Neural 
Networks include the use of Back-Propagation Neural Networks for static sign recognition 
and the use of Time-Delay Neural Networks for dynamic sign recognition. The comparison 
of these three techniques within the same environment and experimental setup 1s also novel 
and unique. Finally, the ultimate objective of our research is to model and store haptic data 
at different levels of abstraction. Consequently, each kind of application can use haptic data 
stored at the level of abstraction that 1s most suited to its analytical needs. 

The remainder of this chapter is organized as follows. First, we describe how we ac- 
quired haptic data using the CyberGlove. Our proposed techniques, fixed sampling, group 
sampling, and adaptive sampling, allow us to also take the time dimension into considera- 
tion, which can be used for the analysis of haptic data for dynamic sign recognition. Next we 
explain the three learning techniques that we used for sign recognition. The results of our 
experiments in comparing the three analysis techniques are then reported. Finally, we cover 
other research efforts in sign language recognition and outline our future research plans. 


DATA ACQUISITION 


The development of haptic devices is in its infancy. We have focused our research and 
experiments on the CyberGrasp exoskeletal interface and accompanying CyberGlove, which 
consists of 33 sensors (Table 14-1). We use the CyberGrasp SDK to write handlers to record 
sensor data for our experiments whenever a sampling interrupt occurs. 

The rate at which these handlers are called is thus the maximum rate at which we can 
sample the input signal, and it is a function of the CPU speed. We developed a multi- 
threaded double buffering technique to sample and record data asynchronously. One thread 
is associated with responding to the handler call and copying sensor data into a region of 
system memory. A second thread asynchronously writes this data to disk. The CPU was 
never 100% utilized during this process. This prevents our recording process from interfer- 
ing with the rendering process. There is obvious room for optimization here, as we can run 
our experiments on a dual processor machine and adjust the priority for the second thread. 
We used 10 letters (A to I and L) from ASL for our experiment. We term each of these 10 
letters a sign. The 22 sensor values (excluding sensors 23 to 33 in Table 14-1) are recorded 
in a log file for each sign made by a subject, termed as a session. Each session log file con- 
tains thousands of rows of sensor values sampled at some frequency, which depends on the 
sampling technique used. We denote each such row as a snapshot. We thus have thousands 
of snapshots for each session. 
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Table 14-1: CyberGrasp sensors. 
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Sampling Techniques 


To record several snapshots for each static sign made within a session, we need to 
sample the values of sensors for each subject making a sign. Moreover, ASL is not restricted 
to just static signs. It has some dynamism in signs (e.g., the letter ‘J’ involves moving the 
hand) and in words (e.g., ‘BOX’ is represented by depicting a rectangular shape). Hence, the 
time dimension needs to be considered while recording the data. Thus for both static and 
dynamic signs, the time at which each sensor is sampled impacts the storage and the exact 
representation of the data. If a sensor value is recorded too frequently, then we will obvi- 
ously get a very accurate representation of the sensor data, but the storage requirements and 
the transmission requirements increase. On the other hand, if the sensor value is recorded 
intermittently, we would save storage space but at the same time, we would run the risk of 
recording an inaccurate representation of the data. Thus sampling the sensors at the rate that 
would lead to lower storage space requirements and better accuracy is central to the task of 
data acquisition for any haptic device. We designed and implemented the following sam- 
pling techniques for our experiments (see Shahabi et al., 2001, for details): 


1. Fixed Sampling: Fixed sampling can be approached in two ways. One ap- 
proach is to use the maximum sampling rate rmax allowed by the Software 
Development Kit. While this technique is easy to implement, it is wasteful 
since it records data for each sensor at each possible opportunity regardless of 
the sensor type or the semantics of the session. A more efficient method in- 
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volves finding the minimum sampling rate r0 required for the entire sensor set 
and then using that as the sampling rate. The disadvantage to this approach is 
that we need to identify r0 before we start sampling at that rate. In our ex- 
periments, rmax was 80 Hz and r0 was found to be 67 Hz. 


2. Group Sampling: The intuition for group sampling is that devices such as 
the CyberGrasp have different sensors that can be mapped to groups (e.g., all 
joints of a finger). We can isolate a sampling rate for each group and acquire 
data at different rates, based upon the group membership for each sensor. The 
advantage of this technique is its improvement over the fixed sampling tech- 
nique by further reducing storage space and transmission requirements while 
maintaining accuracy. The difficulty in pursuing the grouped sampling strat- 
egy is in identifying the groups. Our intuition about natural groups may not be 
correct all the time. 


3. Adaptive Sampling: This is a dynamic form of sampling where we try to find 
an optimum rate r; for each sensor i during a given window j of the session. 
The obvious advantage is the optimality of this approach. Adaptive sampling 
reduces bandwidth and storage requirements to far lower levels as compared 
to fixed or group sampling techniques. An additional benefit is that the sam- 
pling rate changes with the nature of the sessions. This makes the adaptive 
approach more efficient than the fixed or group sampling approach. The 
drawback is that it requires a complex implementation. We used a double 
buffering approach where a recording thread samples at the maximum rate 
possible and a storage thread performs the basic sampling methodology to 
identify the Nyquist sampling rates. This buffering approach means that some 
degree of real time acquisition is sacrificed. 


The adaptive sampling approach looks particularly attractive because of its efficiency 
and robustness. In Shahabi et al. (2001), we provide details on these sampling techniques 
and the various tradeoffs among factors like bandwidth, storage, and computational com- 
plexity. 


CLASSIFICATION METHODS 


In this chapter we explore three different classification techniques and evaluate the ac- 
curacy of each technique to detect 10 different hand signs. The employed techniques are 
C4.5 Decision Tree, Bayesian Classifier, and Neural Networks. Each classification technique 
is implemented in two different stages, the training phase and the recognition phase. We base 
our experiments on the data obtained from the first 22 sensors, as we believe that these sen- 
sor values are the most important and they hold all the information required in detecting a 
sign. 


Classification Methods 243 


C4.5 Decision Tree 


Tree induction methods are considered to be supervised classification methods that 
generate decision trees derived from a particular data set. C4.5 uses the concept of informa- 
tion gain to construct a tree of classificatory decisions with respect to a previously chosen 
target classification (Quinlan, 1993). The information gain can be described as the effective 
decrease in entropy resulting from making a choice as to which attribute to use and at what 
level. In addition, the output of the system is available as a symbolic rule base, which allows 
the system developer to determine the factors that have an impact on the selection strategy 
for a given application domain. C4.5 starts with large sets of cases belonging to known 
classes of data. The cases, described by any mixture of nominal and numeric properties, are 
scrutinized for patterns that allow the classes to be reliably discriminated. These patterns are 
then expressed as models, in the form of decision trees or sets of if-then rules, which can be 
used to classify new cases, with an emphasis on making the models understandable as well as 
accurate (Quinlan, 1993). 

C4.5 consists of two major modules: decision tree maker and rule generator. At first 
the entire data set gets partitioned to a smaller subplace to construct a decision tree. How- 
ever, for real world databases the decision trees become huge in practice. Large decision 
trees are always difficult to understand and interpret. In general, it is often possible to prune 
a decision tree to obtain a simpler and more accurate tree. However, a tree may not provide 
any significant insight into data. Figure 14-1 illustrates a rule generated by C4.5 in our ex- 
periment for the letter H. 


Rule 2: if 


Middle outer joint > -1.0824 


Ring outer joint <= -1.1086 
then 


Sign is k with the probability of 37.1% 


Figure 14-1: Sample rule for the letter “H.” 


We employed C4.5 Decision Tree because it provides a model to build a sign recogni- 
tion language. In addition, decision trees in general and C4.5 in particular provide results as 
a set of understandable and interpretable rules. Finally, C4.5 has been used as a benchmark 
in several other papers on machine learning, artificial intelligence, and data mining. 

C4.5 complexity is O(nt) where t is the number of tree nodes, and the number of tree 
nodes often grows as O(n) where n is the number of sessions. The complexity for non- 
numeric data would be O(n’), for numeric data O(n’ logn), and for mixed-type data, some- 
where in between. 
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Bayesian Classification 


Bayesian Classifier is a fast-supervised classification technique. This method is able to 
reduce the risk of various hypotheses about the patterns of missing data. Furthermore, it re- 
sults in both an accurate analysis of an incomplete database and decisions about the predic- 
tions from the data. Naive Bayesian Classification performs well if the values of the attrib- 
utes for the sessions are independent. Although this assumption is almost always violated in 
practice, recent work (Domingos & Pazzani, 1996) has shown that naive Bayesian learning 
is remarkably effective in practice and difficult to improve upon systematically. Bayesian 
Classifier is suitable for large-scale prediction and classification tasks on complex and in- 
complete datasets. We have decided to use the naive Bayesian Classifier in our application, 
for the following reasons. First, it is efficient for both the training phase and the recognition 
phase. Second, its training time is linear in the number of examples and its recognition time 
is independent of the number of examples. Finally, it provides relatively fine-grained prob- 
ability estimates that can be used to classify the new session (Elkan, 1997). 

The computational complexity of Bayesian Classification is fairly low as compared to 
other classification techniques. Consider a session with f attributes, each with v values. Then 
with the naive Bayesian classifier with e sessions, the training time is O(ef) and hence inde- 
pendent of v. 


Neural Networks 


We use Neural Networks for the recognition of both static and dynamic signs with a 
limited vocabulary. Supervised learning is being used for the classification. In this section, 
we first explain the basics of the Neural Network architecture that we used and then discuss 
the setting of its parameters for our experiments. An artificial neuron receives its inputs from 
a number of other neurons or from an external stimulus. A weighted sum of these inputs con- 
stitutes the argument to an activation function. This activation function is generally nonlinear 
(e.g., hard-limiting, sigmoid, or threshold logic). The resulting value of the activation func- 
tion is the output of the Artificial Neural Network (ANN). This output gets distributed along 
weighted connections to other neurons. The actual manner in which the connections are 
made defines the flow of information in the network and is called the architecture of the net- 
work. Useful architectural configurations include single-layer, multilayer, feed-forward, 
feedback, and lateral connectivity. The method used to adjust weights in the process of train- 
ing the network is called the learning rule. Artificial neural systems are not programmed, 
they are taught. The learning can be supervised (e.g., back-propagation) or unsupervised 
(e.g., self-organizing maps). 

A Multilayer Perceptron (MLP) is trained using the supervised-learning rule. The most 
commonly used algorithm for such training is the error-back-propagation-algorithm. MLPs 
are feed-forward networks with one or more layers of nodes between the input and output 
layers of nodes. These additional layers contain hidden nodes that are not directly connected 
to both the input and the output nodes. The capabilities of the multilayer perceptrons stem 
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from the nonlinearity used in these nodes. The number of nodes in the hidden layer must be 
large enough to form a decision region that can be as complex as required by a given prob- 
lem. A three-layer perceptron can form arbitrarily complex decision regions. Hence, usually, 
most problems can be solved by three-layer (one hidden layer) perceptrons. 

A vital attribute of any trained neural network is the ability to extract the discriminant 
information from a large number of examples. Hence, Neural Networks are ideal for com- 
plex pattern-recognition problems whose solution requires knowledge that is difficult to 
specify but which is available in the form of examples. They have been studied within a va- 
riety of applications. 
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Figure 14-2: Supervised learning diagram with NN. 


Supervised Learning Rules 


Supervised learning is based upon the availability of an external teacher, as illustrated 
in Figure 14-2. The desired response represents the optimum action to be performed by the 
neural network. The network parameters are adjusted under the combined influence of the 
training vector and the error signal. This adjustment is carried out iteratively in a step-by- 
step fashion with the aim of eventually making the network emulate the teacher. When this 
condition is reached, we may remove the teacher and let the neural network deal with the 
environment thereafter entirely by itself. Figure 14-2 shows the supervised learning diagram. 
Examples of supervised learning algorithms include the back-propagation algorithm. 
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The Multilayer Perceptron Error BP Learning Algorithm 


The back-propagation algorithm is an iterative algorithm designed to minimize the 
mean-squared error between the actual output of a feed-forward perceptron and the desired 
output. It requires continuous differentiable nonlinearity. The following equation assumes 
that a sigmoid logistic nonlinearity is used where the function /(f) is: 


l 
f(B)- "m 
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(14.1) 


The MLP back-propagation algorithm can be described as follows (Lippmann, 1987). 
First, we initialize the weights and the offsets of the neural network to small random values. 
The next step involves presenting the neural network with the input and the desired outputs. 
We present a continuous value input-vector x (0), x (1), ... x (n-1) and specify the desired 
outputs d (0), d (1), ... d (m-1). Since we use the net as a classifier, all the outputs are set to 
0, except for the output that corresponds to the class to which the input belongs, which is set 
to 1. The input could be new on each trial or samples from the training set could be pre- 
sented cyclically until the weights stabilize. The complete set of training inputs is called an 
epoch. Next, we compute the actual outputs. We use the sigmoid nonlinearity from Equation 
14.1 and use the following equation to compute the output: 
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The output for the subsequent hidden layer and the final layer is then computed using a 
similar equation. We recursively adjust the weights, starting from the output node, working 
back towards the hidden layer. We use the following formula to adjust the weights: 


w()G)(G =w) Ges) GO) (14.3) 


In this equation w(i)(j)(t) is the weight from a hidden node i or from an input node j at 
time f, x '(i) is either the output from node i or is an input, u is a gain term, called the Learn- 
ing Rate, and $(j) is an error term for node j. We then compute the error using the following 


equation. Mean Square Error: E- v2 ($(4 6) -»C)) ) 


If the error is sufficiently small, we stop with the learning process; otherwise, we iter- 
ate through the above processes until the error is sufficiently small. 

While the algorithm works its way forward to the output layer, the error gradient is ac- 
tually computed from the output layer, backwards. Hence, it is historically being called the 
back-propagation algorithm. The Neural Network convergence is sensitive to the number, 
type, and sequence of inputs during training and the initialization of random weights. 
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Implementation of Neural Network Classification over 
Static Signs 


We used the following parameters throughout our experimentation. These parameters 
were chosen to be optimal in given conditions and given data, over multiple runs. 
Number of Nodes: 
1. Input Layer: 22 nodes (one per sensor), plus one threshold value node for the 
next layer. 


2. Hidden Layer: The MLP used one hidden layer with 10 nodes. 
3. Output Layer: 10 nodes, each corresponding to one posture. 


Training Data: 

With each of the 23 inputs (22 haptic glove values + 1 threshold) connected to each of 
the 11 hidden layer neurons (10 neurons + 1 threshold), and again each of these hidden layer 
neurons being connected to each of the 10 output neurons, the total number of weights in the 
network is (23 x 10) + (11 x 10) = 340 weights. 

For our experiment on static signs with 340 weights, we establish the cardinality of the 
training set to achieve a good generalization as propounded in Vapnik and Chervonenkis 
(1971), approximately 10 times more, to cross the *VCDim" threshold. We analyzed the 
recorded log file for every subject-sign pair session and extracted 40 snapshots from each of 
these 10 subjects. Hence, we have a training data set of 10 subjects, each making 10 signs, 
and for each sign-subject pair we have 40 snapshots, resulting in 4000 sets of sensor values. 
We train the network for 500 epochs. The error rate stabilizes to two places after the deci- 
mal. We generate pseudorandom weights, the range of which is —1.0 to +1.0. The data is 
affected by noise but was input to the neural network without any preprocessing except for 
normalization. It was normalized to the range of —1 to +1. We strove to make the neural net- 
work learn on raw haptic data so that it learns to handle noisy data. This can be useful when 
we try to use the classifier in real-time immersive applications. 

Similar work has been done in Salomon and Weissmann (1999), wherein they use all 
possible groupings of two fingers as input. This yields very good results on the training set, 
but the ability of this approach to be generalized needs to be ascertained. Our approach pro- 
vides a good promise for an overall generalization. 


Theoretical Setup for Classification of Dynamic Signs 


We were preparing to classify a restricted vocabulary of dynamic signs. Each subject 
might perform the same sign with different speeds. With a fixed sampling rate (Shahabi et 
al., 2001) for each session, the chances were high that we would have the same sign repre- 
sented by a different number of samples in different sessions. This called for a way to incor- 
porate the temporal dimension into the haptic data. We proposed to use the Time-Delay 
Neural Network (TDNN) approach towards this end. 
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Time-Delay Neural Network 


TDNN (Waibel, 1989) is a multilayer feed-forward network and it is trained with the 
back-propagation algorithm. We used TDNN for haptic data because it can learn and repre- 
sent relationships between events in time and it can learn complex nonlinear decision sur- 
faces, especially with high-dimensional input data. TDNN can learn inherent features in a 
manner that is invariant under translation in time. This can be achieved by feeding the input 
sequence into a tapped delay line, and then feeding the taps from the delay line into an MLP. 


Input and its Preprocessing 


Taking the time dimension into consideration results in a major bottleneck for this ar- 
chitecture, since we can unfold the sequence only over a finite period of time. The delay line 
can only be of finite extent, requiring the same number of sensor-value-sets for every input. 
We decided upon this fixed typical number ‘N’ as follows: 

N = (typical length of a gesture in seconds x sampling rate) 

With different subjects performing the signs with different speeds, it is not possible to 
have constant N for every sign at a fixed sampling rate. Hence we plan to implement stan- 
dard signal processing reparameterization techniques such as Dynamic Time Warping 
(DTW), which has been implemented earlier in the speech recognition literature (Rabiner & 
Juang, 1993) or similar techniques on glove-based inputs, discussed in Sandberg (1997). 

After expanding the temporal dimension of a gesture spatially, we decided on the 
length of the input window as 2 seconds. With a fixed sampling rate of q sessions/second, we 
shall have 2q as the length of delay line, giving (2 x q x 22) input nodes. 


Architecture 


We planned to use two hidden layers and one output layer apart from the input layer. 
The number of units (nodes) in the output layer as well as the second hidden layer is V, 
where V is the size of the vocabulary. We planned to experiment with the exact number for 
the first hidden layer, though intuitively it seems that seven nodes to extract apparent fea- 
tures of a palm should be appropriate. Learning proceeds analogous to the back-propagation 
algorithm discussed earlier, although the optimal parameter settings will be different. 


PERFORMANCE EVALUATION 


We conducted several experiments to evaluate and compare our three different analy- 
sis techniques for ASL recognition: C4.5 Decision Tree, Bayesian Classifier, and Neural 
Networks. Below we explain our experimental setup, the results for each method and a com- 
parison among these different techniques. 
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Experimental Setup 


Fifteen subjects were selected to generate ASL signs from a given vocabulary. The 
subjects were asked to generate the following signs: A, B, C, D, E, F, G, H, I, and L, and 
data were stored in a database. The signs J and K are complicated and, taking the novice 
subjects into consideration, the signs were skipped for simplicity. We then determined the 
result of using each classification technique for ASL recognition. To evaluate each algorithm 
we used the cross-validation technique. We split the data into three sets, trained the system 
using two of the sets and conducted the tests using the third set. We implemented the test 
procedure in a round robin fashion and computed the average error (i.e., precision and re- 
call). For example, if we split the data set into three different sets, denoted as set-/, set-2 and 
set-3, we go through the following steps to perform the experiments in round robin and to 
compute the average error: 


1. Train the data with set-/ and set-2 and test on set-3. 
2. Train the data with set-2 and set-3 and test on set-1. 
3. Train the data with set-/ and set-3 and test on set-2. 


Storage of the Input 


The Neural Network was trained using 4000 snapshots, as described earlier. This data 
was extracted from 100 session log files (10 subjects, 10 signs each, 40 different snapshots). 
The log files were produced as a result of recording the sessions. 

For our experiments on static signs, we analyzed the recorded log files stored in a da- 
tabase and extracted the snapshot that has the sensor values consistent over a substantial pe- 
riod of time. To find such a snapshot, we used an SQL query similar to the one stated below. 
The following is an example of a simple SQL query when the database has only one snap- 
shot and the generalization to more than one snapshot is straightforward. We assume the 
following table with attributes: CyberGlove (time, snapshot) 

The following is a sample query: 


SELECT snapshot FROM CyberGlove c1, CyberGlove c2 
WHERE c2.snapshot IN( 

SELECT max(time),snapshot FROM CyberGlove c3 
WHERE c3.time « c2.time 

ORDER by c3.time) AND 

cl.snapshot - c2.snapshot « DELTA 

ORDER BY (cl.snapshot - c2.snapshot) 
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Classification algorithms can be developed using incremental learning. The /earner 
updates the rules, trees, and weights using the new session. The details of incremental learn- 
ing are beyond the scope of this chapter and have been addressed in the machine learning 
literature. The details of the data acquisition techniques have been addressed earlier. 


Results 
Tables 14-5, 14-6, and 14-7 depict the precision and recall values for each of the clas- 


sifiers. Figures 14-3 and 14-4 summarize the tables by comparing the average recognition 
error for each sign using the three different classification techniques. 
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Figure 14-3: Sign recognition error. 
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Figure 14-4: Sign recognition error comparison. 
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The naive Bayesian Classifier has the highest average accuracy with 50 training exam- 
ples: 84.66% (with a standard deviation of 2.94). In contrast, C4.5 has an average of 78% 
(SD = 8) and Back-Propagation Neural Network has an average accuracy of 79.82% (SD = 
7.92). Table 14-2 illustrates a comparison among the techniques. 


Table 14-2: Overall classification error. 
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Analysis 


The Bayesian Classifier gives a very efficient and accurate result as compared to other 
classification techniques. The results of our experiments illustrate that C4.5 Decision Tree 
may not be suited to the task of sign recognition. Both Neural Networks and C4.5 have a 
large amount of variation in their performance. However, most often, C4.5 results are more 
interpretable and understandable. In contrast, the Neural Network architecture and procedure 
are not interpretable, and it is similar to a black box, in which case we only have access to 
input and output. Our experiments indicate that all of the classifiers performed relatively 
well on signs “B,” “H,” “I,” and *L." Inspecting the signs, it appears that it was intuitive for 
subjects to perform these signs most consciously. Considering all the signs as points in a 22- 
dimension hyperspace, and computing the Euclidian distance among them, we realized that 
on the average, these four signs are quite apart from the rest of the signs, which justifies our 
observation. On the other hand, the letter *E" was quite close in distance to all the other 
signs and hence all classifiers were confused one way or the other with the recognition of 
letter *E." 

The performance variation of individual classifiers over the signs can be traced back to 
the performance characteristics of each classifier. A neural network inherently tries to draw 
crisp distinguishing boundaries between groups of signs in the 22-dimensional hyperspace. 
Hence, it distinguished all the signs made when the hand is in the horizontal position (i.e., 
“C,” “G,” and *H") quite well (see Table 14-3). Note that although C4.5 was the best classi- 
fier for the letter *H," it had the minimum recognition error among the other letters with the 
neural net. With C4.5 and Bayesian Classifiers, the main assumption is that all features in a 
given space are independent. In general any strong dependency increases the level of error 
for both methods, while a low degree of dependency among features might have a negligible 
effect. Further, since C4.5 produces decisions based on a set of “if-then” rules, it tends to be 
relatively rigid, resulting in a high standard deviation as well as a high overall error. Since 
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Bayesian Classifier decides based on probability distribution of the input samples, it tends to 
perform quite well overall despite intuitive variations in performance of signs by different 
subjects. 


Table 14-3: Best recognition technique for each sign. 


fies n G s 
be 


Table 14-4: Nearest neighbors for each sign in multidimensional space. 


Nearest Farthest Avg. Euclidean 
Distance 


^ [s [s T- T9 p [a [e Tr T5 nmm — ] 
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fp [c [s [e |e ]^ ]r [a [1 e oom | 


s [^ [» fe p fo ]r k ]5 [a [aire — ] 
t [c [s |» ]s | [a [a [o |: [sm — ] 
re [o [^ [s ]» fe p fe ]r [e nw — ] 
m [e [» [a |e e le p Ti [a ee — ] 
n [^ [s ]e [o ]r [eo fe [x [e 2x —] 
[s [» [a ]s [a [eo 1 dr [a 2er — ] 


As illustrated in Table 14-8,' we see that the best classifier for a sign is not necessarily 
the one that confuses the sign with fewer other signs. The decision to choose a classifier 
from given classifiers becomes very much application-dependent. To illustrate this observa- 
tion consider an example: C4.5 gives the highest average accuracy for sign L, but the other 
two techniques confuse L with fewer signs (one each) compared to that of C4.5 (three signs). 
In sum, we show that even with a small pool of snapshots, with a fast learner such as Naive 


"This table has been created using the previous tables. The percentage error > 0.00 for each sign is taken as confu- 
sion of that sign for the classifier. For Neural Networks, this threshold is fixed at 3.00. 
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Bayesian Classifier and an appropriate I/O design we can achieve an acceptable perform- 


ance. 


Table 14-5: C4.5 Precision and recall. 
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Table 14-7: Neural network precision (standard deviation) and recall. 
Je Je fe fo fe fe fe jue pr |- | 
rpm pm pup um pm um ur 
15.78 | 5.36 5.62 8.59 10.56 | 7.77 4.67 4.27 7.20 2.84 
a pala keta eie k he e 
4.20 8.46 3.02 1.83 2.74 1.14 0.00 0.51 0.00 0.00 
sk [a i [ae [as [iss [a [ise [ar [sa 
3.44 10.38 0.17 | 9.57 2.77 2.60 0.00 1.36 1.11 0.28 
Piss [as tatale te ke te BERE 
5.54 4.15 0.66 | 16.65 | 8.50 7.09 0.00 1.03 0.00 0.00 
ase (amt [isa [ie [ss Ta [ase Pe [a [te 
8.16 0.84 0.62 | 13.18 | 21.13 | 1820 | 6.84 7.19 21.59 | 127 


F | 6.98, 0.96, 7.10, 1.18, 6.14, 75.14, | 0.00, 0.80, 1.22, 0.06, 
11.13. | 3.97 2.74 | 3.80 9.57 23.88 | 0.00 2.52 3.97 0.24 
G | 0.82, 1.49, .67, 3.61, 1.57, 0.00, 81.00, | 1.65, 0.00, 7.94, 
3.72 3.46 3.56 7.27 5.38 0.00 17.87 | 4.31 0.00 12.94 
H 1.20, 1.06, 0.55, 1.29, 0.02, 0.12, 0.00, 95.51, | 0.00, 0.00, 
4.14 3.05 .83 4.09 0.14 0.70 0.00 7.60 0.00 0.00 
I 0.57, 0.67, 2:35; 1.43, 0.00, 0.29, 0.00, 0.35, 94.12, | 0.00, 
2.72 1.88 4.67 4.64 0.00 1.00 0.00 1.86 9.11 0.00 
L 1.55, 0.78, 53, 0.88, 0.51, 0.24, 4.39, 0.86, 0.33, 88.57, 
4.46 2.12 3.33 2.87 2.55 1.41 8.20 2.46 1.42 9.32 


Table 14-8: Number of other signs with which each sign is confused for different 


classifiers. 
Neural Networks 
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RELATED WORK 


Various research groups worldwide have been investigating the problem of sign rec- 
ognition. We are aware of the two main approaches. The machine-vision-based approaches 
analyze the video and image data of a hand in motion. This includes both the 2D and 3D 
position and the orientation of one or two hands. The haptics-based approaches analyze the 
haptic data from a glove. Quantified values of the various degrees of freedom for the hand 
constitute the data. These efforts have resulted in the development of devices such as the 
CyberGlove. 

We categorize the first approach based on the techniques employed. Darrell and Pent- 
land (1993) discuss vision-based recognition with “Template Matching.” Heap and Samaria 
(1995) employ Active Shape models. Several studies such as Martin and Crowley (1997) 
and Birk, Moeslund, and Madsen (1997) propose to use Principal Components Analysis. Yet 
another method of recognition using linear fingertips 1s described in Davis and Shah (1993). 
Banarase (1993) uses a neocognitron network. Like our group, various researchers have tried 
to recognize various sign languages all over the world using different methods. These in- 
clude the American (ASL), Australian (AUSLAN), Japanese (JSL), and Taiwanese (TWL) 
Sign Languages, to name a few. As for the most relevant, the task of ASL recognition has 
been pursued by numerous research groups (Starner, 1995; Starner & Pentland, 1996; Vo- 
gler & Metaxas, 1997, to cite a few who use the Hidden Markov Models). An excellent sur- 
vey of vision-based sign-recognition methods is provided in Wu and Huang (1999). 

Using gloves and haptic data, Fels and Hinton (1995) employed a VPL Glove to carry 
out sign recognition. Takahashi and Kishino (1991) also investigated the understanding of 
the Japanese Kana manual alphabet (consisting of 46 signs) using a VPL DataGlove. They 
constructed a table that designated the positions of individual fingers and joints that would 
indicate a particular hand shape. They reported that they could successfully interpret 30 of 
the 46 signs, while the remaining 16 could not be reliably identified, due to a variety of con- 
straints, such as the fact that they were moving gestures and that sufficient distinction could 
not be made in situations where fingertips touched. Sandberg (1997) provides an extensive 
coverage and employs a combination of Radial Basis Function Network and Bayesian Clas- 
sifier to classify a hybrid vocabulary of static and dynamic hand signs. One more variant 
exists, Recurrent Neural Networks, used by Murakami and Taguchi (1991) for classifying 
Japanese sign language. We believe that the work by Salomon and Weissmann (1999) is the 
most relevant to our own, since it attempts to recognize signs using the same CyberGlove 
and back-propagation algorithm that we use. Hidden Markov Models are popular here too, 
which is reflected in Nam and Wohn (1996) and Lee and Yangsheng (1996). The latter is 
particularly relevant because it presents an application for learning signs through Hidden 
Markov Models, taking the data input from the CyberGlove. Wu, Wen, Yibo, Wei, and Bo 
(1998) use a combination of MLP-BP and HMM. Kadous (1995) used instance-based learn- 
ing to classify Australian Sign Language. Newby (1993) studied glove-based template 
matching using a simple sum-of-squares approach. Rubine (1991) proposed feature extrac- 
tion, while Charaphayan and Marble (1994) compared the approaches of Dynamic Pro- 
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gramming, HMM, and Recurrent Neural Networks. Dorner and Hagen (1994) are unique in 
that they have taken a holistic approach to the question of ASL interpretation. Hagen's work 
involved building a deductive database that successfully translates from a standardized form 
of ASL into spoken English. Other neural network algorithms—Radial Basis Function Net- 
work (RBFN), Orthogonal Least Squares and Self-Organizing Maps—have also been tested 
on various kinds of data-glove inputs, in Lin (1998) and Ishikawa & Matsumura (1999). 
Salomon and Weissmann (2000) go further to show that RBFN is better than MLP-BP for 
classifying dynamic signs in an evolutionary manner. 

Our work is distinct from all of the above-mentioned works because we provide a 
complete system including I/O unit, data acquisition module, database structure, and classifi- 
cation methods for ASL recognition. All our analysis is carried out on raw haptic data. We 
are the first to use Decision Tree for the analysis of haptic data. We are also the first to use 
Bayesian Classifier for raw (1.e., not preprocessed) haptic data analysis. Taking our frame- 
work into consideration, we are also the first to use Back-Propagation Neural Networks for 
the recognition of static signs. We propose to use Time-Delay Neural Networks for the rec- 
ognition of dynamic signs. 


CONCLUSION AND FUTURE WORK 


In this chapter, we analyzed three different classification techniques for sign language 
recognition. We showed that Decision Tree, Bayesian Classifier, and Neural Networks could 
be used for ASL recognition. Bayesian Classifier proved to be the fastest classification tech- 
nique among the three we evaluated. It also proved to have the best classification accuracy 
for static sign recognition. We carried out several preliminary experiments and the results of 
our experiments suggest that Bayesian Classifier can be used to develop a real-time sign 
language recognition system. However, more work needs to be carried out in order to estab- 
lish the validity of our results, which are very encouraging in the early stages of experimen- 
tation. There are many open questions, obstacles, and problems that need to be dealt with 
before we achieve an efficient, reliable, and applicable ASL recognition system. 

We intend to extend our work in several ways. First, in addition to Time-Delay Neural 
Networks, we also intend to investigate Evolving Fuzzy Neural Networks for the recognition 
of dynamic signs. It would be interesting to compare the effectiveness of these two tech- 
niques. Second, we want to use the analysis of haptic data that we pursued for the modeling 
of haptic data in a database. Our analysis would tell us what data we need to store at which 
level of abstraction for a given application. Third, we would like to analyze haptic data at the 
third level of abstraction, which requires us to analyze preprocessed haptic data. Finally, we 
propose to use shape recognition techniques for the recognition of dynamic signs based upon 
a fixed sign language vocabulary. Here, each dynamic sign would be considered to have a 
static part and an associated dynamic part. Using a shape to represent the dynamic part 
would let us view the dynamic sign in a time-independent manner. Such an approach would 
need an efficient technique for tracking the haptic device. 
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